Please start with reading task description in exercise_description.pdf.
You will find concentrations of spatial events first in space and time and then in space by means of a density-based clustering algorithm DBSCAN.
conda install -c conda-forge seaborn
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.manifold import MDS
from sklearn.preprocessing import MinMaxScaler
import folium as fm
from IPython.core.display import Markdown, display, HTML
# beautify the screen display
%matplotlib inline
pd.options.display.max_columns = 200
#display(HTML("<style>.container { width:100% !important; }</style>"))
display(HTML("""<style>
.rendered_html tr, .rendered_html th, .rendered_html td { text-align: right; }
.rendered_html :first-child { text-align: left; }
.rendered_html :last-child { text-align: left; }
</style>"""))
# Tweak default output of pyplots
screen_dpi = plt.rcParams['figure.dpi']
# 'figsize' is in inches, so convert desired default figure size in pixels into inches using the given sceen dpi
plt.rcParams["figure.figsize"] = [800/screen_dpi,600/screen_dpi]
# Use in-line date string conversion/parsing, it is significantly more effective
# than applying pd.to_datetime() on a dataframe column post-loading
def dt_parse(timestamp_str):
return pd.to_datetime(timestamp_str, format='%Y%m%d %H:%M:%S', errors='coerce')
The data you will use are metadata records of geolocated flickr photos taken on the territory of the North America in the years from 2007 till 2016 that have keywords related to cherry blossoming in the titles or tags.
The folder includes 3 data sets of different size:
cherryblossom_USA_filtered.csv: 10 years of data
cherryblossom_2012-2017.csv: 6 years of data
cherryblossom_2012-2014.csv: 3 years of data
More data slows down clustering and may cause problems for visualization on some computers. We suggest you to start with the smallest dataset.
#events = pd.read_csv('cherryblossom_2012-2017.csv', sep=',', decimal='.',
#header=0, index_col='PhotoID', parse_dates=['DateTaken'], date_parser=dt_parse)
#events = pd.read_csv('cherryblossom_USA_filtered.csv', sep=',', decimal='.',
#header=0, index_col='PhotoID', parse_dates=['DateTaken'], date_parser=dt_parse)
events = pd.read_csv('cherryblossom_2012-2014.csv', sep=',', decimal='.',
header=0, index_col='PhotoID', parse_dates=['DateTaken'], date_parser=dt_parse)
events.describe()
events.head()
if you receive the following error message
IOPub data rate exceeded. The notebook server will temporarily stop sending output to the client in order to avoid crashing it. To change this limit, set the config variable `--NotebookApp.iopub_data_rate_limit`.please re-start Jupyter Notebook with extra parameters:
jupyter notebook --NotebookApp.iopub_data_rate_limit=10000000000
lon_range = (-130.60, -52.75)
lat_range = (17.13, 53.65)
m = fm.Map(tiles='cartodbpositron', width='90%', height='90%')
# If you adjusted the notebook display width to be as wide as your screen, the map might get very big.
# Adjust size as desired.
m.fit_bounds([[lat_range[0], lon_range[0]], [lat_range[1], lon_range[1]]])
for id, row in events.iterrows():
fm.CircleMarker((row.latitude, row.longitude), radius=2, color='#0000FF0D', fill=False).add_to(m)
m
from mpl_toolkits.mplot3d import Axes3D
fig = plt.figure(1, figsize=(9, 6))
ax = Axes3D(fig, auto_add_to_figure=False, rect=[0, 0, .95, 1], elev=25, azim=-115) # change parameters here for looking from a different perspective
ax.scatter(events['longitude'], events['latitude'], events['Days since 01/01/2007'], alpha=0.1)
fig.add_axes(ax)
plt.show()
Look at a selected single year in 3D (space + time):
fig = plt.figure(1, figsize=(9, 6))
events_1year=events[events['DateTaken: year']==2012] # Change the year to see the data for another year
ax = Axes3D(fig, auto_add_to_figure=False, rect=[0, 0, .95, 1], elev=25, azim=-115) # change parameters here for looking from a different perspective
ax.scatter(events_1year['longitude'], events_1year['latitude'], events_1year['Days since 01/01/2007'], alpha=0.1)
fig.add_axes(ax)
plt.show()
These colours could be used for up to 12 clusters, but density-based clustering often produces much more than that. In the following, the Color Brewer's colours are not used. Instead, colours for clusters are generated based on their positions in a spatialisation (2D projection) reflecting cluster similarities.
# 12-step qualitative color scale, courtesy of www.colorbrewer2.org
# These colours, however, are not used in the current code
clust_colors = ['#a6cee3','#1f78b4','#b2df8a','#33a02c','#fb9a99','#e31a1c','#fdbf6f',
'#ff7f00','#cab2d6','#6a3d9a','#ffff99','#b15928']
The following function getColor(...) will be used for a meaningful, similarity-aware assignment of colours to clusters. The idea is to represent clusters by points in an artificial 2D space by applying a data embedding method (such as MDS - non-linear multidimensional scaling) to aggregate characteristics of clusters and generate colours for the clusters based on the positions of the corresponding points.
import math
def getColor (x, y, minX, maxX, minY, maxY):
wX=maxX-minX
wY=maxY-minY
rr=y-minY
cc=x-minX
#print(x,y)
if (wY < wX): #scale vertically, i.e. modify rr
rr *= wX/wY
else: #scale horizontally, i.e. modify cc
cc *= wY/wX
maxD=max(wX,wY)
rr1=maxD-rr
cc1=maxD-cc
#print(rr,cc,maxD,rr1,cc1)
dc=[math.sqrt(rr*rr+cc*cc),math.sqrt(rr*rr+cc1*cc1),math.sqrt(rr1*rr1+cc*cc),math.sqrt(rr1*rr1+cc1*cc1)]
weights=[0.0,0.0,0.0,0.0]
for i in range(len(weights)):
weights[i]=(maxD-dc[i])/maxD
if (weights[i]<0):
weights[i]=0
#print(dc,weights)
reds=[228,25,255,37]
greens=[220,228,18,13]
blues=[0,218,6,252]
dr=0
dg=0
db=0
for i,weight in enumerate(weights):
dr += weight*reds[i]
dg += weight*greens[i]
db += weight*blues[i]
if (dr<0):
dr=0;
if (dr>255):
dr=255
if (dg<0):
dg=0;
if (dg>255):
dg=255
if (db<0):
db=0;
if (db>255):
db=255
#print(weights,dr,dg,db)
c_string = '#{:02x}{:02x}{:02x}'.format(int(dr),int(dg),int(db))
return c_string
from sklearn.cluster import DBSCAN as dbscan
import math
kms_per_radian = 6371.0088
spatial_dist_max = 20 / kms_per_radian
temporal_dist_max = 7
Both use great_circle distance between two points specified by their coordinates. Note that we optimize the code for eliminating unnecessary computations of spatial distances.
def great_circle(lat1, long1, lat2, long2):
# Convert latitude and longitude to
# spherical coordinates in radians.
degrees_to_radians = math.pi/180.0
# phi = 90 - latitude
phi1 = (90.0 - lat1)*degrees_to_radians
phi2 = (90.0 - lat2)*degrees_to_radians
# theta = longitude
theta1 = long1*degrees_to_radians
theta2 = long2*degrees_to_radians
# Compute spherical distance from spherical coordinates.
# For two locations in spherical coordinates
# (1, theta, phi) and (1, theta, phi)
# cosine( arc length ) =
# sin phi sin phi' cos(theta-theta') + cos phi cos phi'
# distance = rho * arc length
cos = (math.sin(phi1)*math.sin(phi2)*math.cos(theta1 - theta2) +
math.cos(phi1)*math.cos(phi2))
if (cos > 1.0):
cos = 1.0
arc = math.acos( cos )
# Remember to multiply arc by the radius of the earth
# in your favorite set of units to get length.
return arc
def SpaceDistance(x,y):
try:
gc_dist = great_circle(x[1],x[0],y[1],y[0])
except ValueError:
gc_dist = np.Infinity
if (gc_dist>spatial_dist_max):
return np.Infinity
else:
return gc_dist
#return great_circle(x[1],x[0],y[1],y[0])
def SpaceTimeDistance(x,y):
#print('Params = {},{}'.format(spatial_dist_max,temporal_dist_max))
diff_days = math.fabs(x[2] - y[2])
if (np.isnan(diff_days) or diff_days > temporal_dist_max):
return np.Infinity
try:
gc_dist = great_circle(x[1],x[0],y[1],y[0])
except ValueError:
#print(x[1],x[0],y[1],y[0])
gc_dist = np.Infinity
if (gc_dist>spatial_dist_max):
return np.Infinity
ratio_t=diff_days/temporal_dist_max
ratio_d=gc_dist/spatial_dist_max
if (ratio_d>ratio_t):
return gc_dist
else:
return ratio_t * spatial_dist_max
The algorithm will find concentrations of photo-taking events in space and time based on the provided spatio-temporal distances between the events.
You are supposed to experiment with different parameter settings and investigate the impact of the clustering parameters on the results.
We have tested the following settings:
spatial_dist_max = 30 / kms_per_radian; temporal_dist_max = 3; n_neighbours = min_samples = 3 spatial_dist_max = 30 / kms_per_radian; temporal_dist_max = 7; n_neighbours = min_samples = 3 spatial_dist_max = 20 / kms_per_radian; temporal_dist_max = 7; n_neighbours = min_samples = 3 spatial_dist_max = 20 / kms_per_radian; temporal_dist_max = 7;n_neighbours = min_samples = 7
Please note: Clustering takes time, you need to wait until the text "Clustering finished!" is printed.
spatial_dist_max = 25 / kms_per_radian
temporal_dist_max = 7
n_neighbours = 7
from sklearn.cluster import DBSCAN
clustered_ST = DBSCAN(metric=SpaceTimeDistance,
min_samples=n_neighbours).fit(events[['longitude','latitude',
'Days since 01/01/2007']])
print("Clustering finished!")
labels=clustered_ST.labels_
unique_labels=np.unique(clustered_ST.labels_)
print('Result: {} records in the noise, labelled as -1, and {} clusters labelled as 0..{}'.
format(events[labels==-1].shape[0], len(unique_labels)-1, len(unique_labels)-2))
#clustered
clust_id_col_name='ClusterN'
events[clust_id_col_name]=labels
# Getting cluster sizes
cluster_sizes = events[clust_id_col_name].value_counts().rename_axis('Cluster id').to_frame('count')
print("Cluster sizes:")
print(cluster_sizes.head(15))
print("...")
print(cluster_sizes.tail(15))
cluster_sizes = cluster_sizes[cluster_sizes.index != -1] # no noise
max_cluster_size=cluster_sizes['count'].max()
print("max = ",max_cluster_size)
agg_func = {
'URL':'count',
'Days since 01/01/2007':['max','min'],
'longitude':['mean','max','min'],
'latitude':['mean','max','min'],
'DateTaken: day of year': ['mean','max','min']
}
st_aggregates = events.reset_index(drop=False)[['ClusterN','URL','Days since 01/01/2007',
'longitude','latitude',
'DateTaken: day of year']].groupby(['ClusterN']).agg(agg_func)
# Flatten hierarchical column names
st_aggregates.columns = ["_".join(x) for x in st_aggregates.columns.values.reshape(-1)]
# compute derived attributes: duration and bounding rectangle diagonal
st_aggregates['duration (days)']=st_aggregates['Days since 01/01/2007_max']-st_aggregates['Days since 01/01/2007_min']
for id,row in st_aggregates.iterrows():
brd=kms_per_radian*great_circle(row['latitude_max'],row['longitude_max'],row['latitude_min'],row['longitude_min'])
#print('{}'.format(brd))
st_aggregates.at[id,'Bound_rect_diag(km)']=brd
st_aggregates
We shall use spatialisation by means of a data embedding method MDS (non-linear multidimensional scaling) for obtaining a 2D spatial arrangement of the clusters and subsequent generation of colours for the clusters according to their positions in the projection. The similarity between clusters is assessed based on their spatial positions and temporal characteristics. The latter include the absolute times of cluster existence along the overall time period and the positions within the yearly cycle.
Please note: Before doing data embedding, we need to apply normalisation to the attributes characterising the clusters because the attributes have incomparable, highly different value ranges.
clusters_data = st_aggregates.loc[st_aggregates.index!=-1,
['latitude_mean','longitude_mean',
'Days since 01/01/2007_min','Days since 01/01/2007_max',
'DateTaken: day of year_min','DateTaken: day of year_max']]
#cl_attr.head()
scaler = MinMaxScaler()
clusters_data_scaled = scaler.fit_transform(clusters_data)
mds_ST = MDS(n_components = 2, random_state=110)
mds_ST.fit(clusters_data_scaled)
xy_mds_ST = mds_ST.fit_transform(clusters_data_scaled)
xmin_ST=xy_mds_ST[:,0].min()
xmax_ST=xy_mds_ST[:,0].max()
ymin_ST=xy_mds_ST[:,1].min()
ymax_ST=xy_mds_ST[:,1].max()
print(xmin_ST,xmax_ST,ymin_ST,ymax_ST)
plt.figure(figsize=(10,10))
plt.xlabel('Axis 1')
plt.ylabel('Axis 2')
plt.title('MDS projection of clusters')
colors = [(0,0,0)]
for i in range(len(xy_mds_ST)):
j=np.where(cluster_sizes.index==clusters_data.index[i])[0][0]
r=cluster_sizes.iat[j,0]/max_cluster_size
size=50 + 300*r
plt.scatter(xy_mds_ST[i,0], xy_mds_ST[i,1], alpha = .9, s = size,
c=getColor(xy_mds_ST[i,0], xy_mds_ST[i,1],xmin_ST,xmax_ST,ymin_ST,ymax_ST))
plt.text(xy_mds_ST[i,0]+0.0001*size, xy_mds_ST[i,1]+0.0001*size,
str(clusters_data.index[i])+": "+str(cluster_sizes.iat[j,0]), alpha = .25+.75*r)
plt.grid()
lon_range = (-130.60, -52.75)
lat_range = (17.13, 53.65)
m = fm.Map(tiles='cartodbpositron', width='100%', height='100%')
# If you adjusted the notebook display width to be as wide as your screen, the map might get very big.
#Adjust size as desired.
m.fit_bounds([[lat_range[0], lon_range[0]], [lat_range[1], lon_range[1]]])
for id, row in events.iterrows():
cluster_id = row[clust_id_col_name]
if cluster_id != -1 and len(np.where(clusters_data.index==cluster_id)[0])>0 :
i=np.where(clusters_data.index==cluster_id)[0][0]
if i<len(xy_mds_ST) :
fm.CircleMarker((row['latitude'], row['longitude']), radius=2,
#color=clust_colors[cluster_id % len(clust_colors)],
color=getColor(xy_mds_ST[i,0], xy_mds_ST[i,1],xmin_ST,xmax_ST,ymin_ST,ymax_ST),
fill=False, opacity=.1,
popup='Cluster: {}'.format(cluster_id)).add_to(m)
m
for id,row in events.iterrows():
cluster_id = row[clust_id_col_name]
if (cluster_id==-1) or len(np.where(clusters_data.index==cluster_id)[0])==0 :
color='#000000'
else:
i=np.where(clusters_data.index==cluster_id)[0][0]
color=getColor(xy_mds_ST[i,0], xy_mds_ST[i,1],xmin_ST,xmax_ST,ymin_ST,ymax_ST)
#color=clust_colors[cluster_id % len(clust_colors)]
events.at[id,'colors']=color
fig = plt.figure(1, figsize=(9, 6))
ax = Axes3D(fig, auto_add_to_figure=False, rect=[0, 0, .95, 1], elev=25, azim=-115) # change parameters here for experimenting
ax.scatter(events['longitude'], events['latitude'], events['Days since 01/01/2007'], c=events['colors'], alpha=0.1)
fig.add_axes(ax)
plt.title('All events; noise in grey')
plt.show()
eventsnn = events[events[clust_id_col_name] != -1] # no noise
fig = plt.figure(2, figsize=(9, 6))
ax = Axes3D(fig, auto_add_to_figure=False, rect=[0, 0, .95, 1], elev=25, azim=-115) # change parameters here for experimenting
ax.scatter(eventsnn['longitude'], eventsnn['latitude'], eventsnn['Days since 01/01/2007'], c=eventsnn['colors'], alpha=0.1)
fig.add_axes(ax)
plt.title('Cluster members only; no noise')
plt.show()
## Only noise in 3D; uncomment if wish to see
#events_noise = events[events[clust_id_col_name] == -1]
#print('noise data, shape (N rows, N columns) = {}'.format(events_noise.shape))
#fig = plt.figure(1, figsize=(9, 6))
#ax = Axes3D(fig, rect=[0, 0, .95, 1], elev=25, azim=-115) # change parameters here for experimenting
#ax.scatter(events_noise['longitude'], events_noise['latitude'], events_noise['Days since 01/01/2007'], alpha=0.1)
#plt.show()
Experiment with the spatial and temporal distance thresholds and the minimal number of neighbours. You need to return to the cell where the parameters are set and DBSCAN is run, change the parameters, and re-run that cell and the following cells. We suggest you to modify only one parameter before each run of the clustering.
After excluding the noise from the result of the last spatio-temporal clustering, you will apply clustering according to the spatial distances between the events, ignoring the time. Each spatial cluster will indicate a place where spatio-temporal concentrations of photo taking events occurred.
For the spatial clusterig, you are supposed to use the same settings for the spatial distance threshold and the minimum number of neighbours as you used in the latest spatio-temporal clustering.
eventsnn = events[events[clust_id_col_name] != -1]
print('Data without noise, shape (N rows, N columns) = {}'.format(eventsnn.shape))
fig = plt.figure(1, figsize=(9, 6))
ax = Axes3D(fig, auto_add_to_figure=False, rect=[0, 0, .95, 1], elev=25, azim=-115) # change parameters here for experimenting
ax.scatter(eventsnn['longitude'], eventsnn['latitude'], eventsnn['Days since 01/01/2007'], alpha=0.1)
fig.add_axes(ax)
plt.show()
Reminder: Set the same values for the spatial distance threshold and the minimum number of neighbours as you used in the latest spatio-temporal clustering. With such settings, you can expect that no noise will be produced.
Please note: Clustering takes time, you need to wait until the text "Clustering finished!" is printed.
spatial_dist_max = 25 / kms_per_radian
n_neighbours = 7
clustered_S = DBSCAN(metric=SpaceDistance, min_samples=n_neighbours).fit(eventsnn[['longitude','latitude']])
print('Clustering finished!')
labels=clustered_S.labels_
unique_labels=np.unique(clustered_S.labels_)
print('Result: {} records in the noise, labelled as -1, and {} clusters labelled as 0..{}'.
format(eventsnn[labels==-1].shape[0], len(unique_labels)-1, len(unique_labels)-2))
clust_id_col_name='ClusterNN'
eventsnn[clust_id_col_name]=labels
#clustered
# Getting cluster sizes
cluster_sizes_S = eventsnn[clust_id_col_name].value_counts().rename_axis('Cluster id').to_frame('count')
print("Cluster sizes:")
print(cluster_sizes_S)
cluster_sizes_S = cluster_sizes_S[cluster_sizes_S.index != -1] # no noise
max_cluster_size_S=cluster_sizes_S['count'].max()
print("max = ",max_cluster_size_S)
agg_func = {
'URL':'count',
'Days since 01/01/2007':['max','min'],
'longitude':['mean','max','min'],
'latitude':['mean','max','min'],
'DateTaken: day of year': ['mean','max','min']
}
spatial_aggregates = eventsnn.reset_index(drop=False)[['ClusterNN','URL','Days since 01/01/2007',
'longitude','latitude',
'DateTaken: day of year']].groupby(['ClusterNN']).agg(agg_func)
# Flatten hierarchical column names
spatial_aggregates.columns = ["_".join(x) for x in spatial_aggregates.columns.values.reshape(-1)]
spatial_aggregates['duration (days)']=spatial_aggregates['Days since 01/01/2007_max']-spatial_aggregates['Days since 01/01/2007_min']
for id,row in spatial_aggregates.iterrows():
brd=kms_per_radian*great_circle(row['latitude_max'],row['longitude_max'],row['latitude_min'],row['longitude_min'])
spatial_aggregates.at[id,'Bound_rect_diag(km)']=brd
spatial_aggregates
Like previously for the spatio-temporal clusters, we shall use MDS for obtaining a 2D projection of the clusters and subsequent generation of colours for the clusters according to their positions in the projection.
Please note: Before doing data embedding, we need to apply normalisation to the attributes characterising the clusters because the attributes have incomparable, highly different value ranges.
s_cluster_data = spatial_aggregates.loc[spatial_aggregates.index!=-1,
['latitude_mean','longitude_mean',
'DateTaken: day of year_min','DateTaken: day of year_max']]
#s_cluster_data.head()
scaler = MinMaxScaler()
s_cluster_data_scaled = scaler.fit_transform(s_cluster_data)
mds_S = MDS(n_components = 2, random_state=110)
mds_S.fit(s_cluster_data_scaled)
xy_mds_S = mds_S.fit_transform(s_cluster_data_scaled)
xmin_S=xy_mds_S[:,0].min()
xmax_S=xy_mds_S[:,0].max()
ymin_S=xy_mds_S[:,1].min()
ymax_S=xy_mds_S[:,1].max()
print(xmin_S,xmax_S,ymin_S,ymax_S)
plt.figure(figsize=(6,6))
plt.xlabel('Axis 1')
plt.ylabel('Axis 2')
plt.title('MDS projection of spatial clusters')
colors = [(0,0,0)]
for i in range(len(xy_mds_S)):
j=np.where(cluster_sizes_S.index==s_cluster_data.index[i])[0][0]
r=cluster_sizes_S.iat[j,0]/max_cluster_size_S
size=50 + 300*r
plt.scatter(xy_mds_S[i,0], xy_mds_S[i,1], alpha = .9, s = size,
c=getColor(xy_mds_S[i,0], xy_mds_S[i,1],xmin_S,xmax_S,ymin_S,ymax_S))
plt.text(xy_mds_S[i,0]+0.0001*size, xy_mds_S[i,1]+0.0001*size,
str(s_cluster_data.index[i])+": "+str(cluster_sizes_S.iat[j,0]), alpha=.5)
plt.grid()
m = fm.Map(tiles='cartodbpositron', width='100%', height='100%')
# If you adjusted the notebook display width to be as wide as your screen, the map might get very big.
# Adjust size as desired.
m.fit_bounds([[lat_range[0], lon_range[0]], [lat_range[1], lon_range[1]]])
for id, row in eventsnn.iterrows():
cluster_id = row[clust_id_col_name]
if (cluster_id != -1) and len(np.where(s_cluster_data.index==cluster_id)[0])>0 :
i=np.where(s_cluster_data.index==cluster_id)[0][0]
if i<len(xy_mds_S) :
fm.CircleMarker((row['latitude'], row['longitude']), radius=2,
#color=clust_colors[cluster_id%len(clust_colors)],
color=getColor(xy_mds_S[i,0], xy_mds_S[i,1],xmin_S,xmax_S,ymin_S,ymax_S),
fill=False, opacity=.5,
popup='Cluster: {}'.format(cluster_id)).add_to(m)
m
fig = plt.figure(1, figsize=(9, 6))
for id,row in eventsnn.iterrows():
cluster_id = row[clust_id_col_name]
if (cluster_id==-1) or len(np.where(s_cluster_data.index==cluster_id)[0])==0 :
color='#000000'
else:
i=np.where(s_cluster_data.index==cluster_id)[0][0]
color=getColor(xy_mds_S[i,0], xy_mds_S[i,1],xmin_S,xmax_S,ymin_S,ymax_S)
#color=clust_colors[cluster_id % len(clust_colors)]
eventsnn.at[id,'colors']=color
#eventsnn = events[events[clust_id_col_name] != -1] # no noise
ax = Axes3D(fig, auto_add_to_figure=False, rect=[0, 0, .95, 1], elev=25, azim=-135) # change parameters here for experimenting
ax.scatter(eventsnn['longitude'], eventsnn['latitude'], eventsnn['Days since 01/01/2007'],
c=eventsnn['colors'], alpha=0.1)
fig.add_axes(ax)
plt.show()
You aggregate and arrange data so that you can conveniently see and compare the times and durations of cluster existence across the places and the years.
Please try to visualise these data, e.g., by bar charts.
agg_func = {
'URL':'count',
'DateTaken: day of year':['min','max']
}
st_aggregates = eventsnn.reset_index(drop=False)[['ClusterNN','DateTaken: year',
'URL','DateTaken: day of year']].groupby(['ClusterNN','DateTaken: year']).agg(agg_func)
# Flatten hierarchical column names
st_aggregates.columns = ["_".join(x) for x in st_aggregates.columns.values.reshape(-1)]
st_aggregates['duration (days)']=st_aggregates['DateTaken: day of year_max']-st_aggregates['DateTaken: day of year_min']
st_aggregates
Please try to visualise these data, e.g., by bar charts.
agg_func = {
'URL':'count',
'DateTaken: day of year':['min','max']
}
st_aggregates = eventsnn.reset_index(drop=False)[['ClusterNN','DateTaken: year',
'URL','DateTaken: day of year']].groupby(['DateTaken: year','ClusterNN']).agg(agg_func)
# Flatten hierarchical column names
st_aggregates.columns = ["_".join(x) for x in st_aggregates.columns.values.reshape(-1)]
st_aggregates['duration (days)']=st_aggregates['DateTaken: day of year_max']-st_aggregates['DateTaken: day of year_min']
st_aggregates
You have two kinds of aggregates that allow you to answer the analytical questions about (1) local dynamics in clusters and (2) dynamics of the overall spatial situations. You may use, for example, contour plots for showing isochrones of season start at a single year, see https://matplotlib.org/mpl_toolkits/mplot3d/tutorial.html
PB_clustering.ipynb (notebook)
PB_clustering.html (HTML version)